Speaker change detection using joint audio-visual statistics
نویسندگان
چکیده
In this paper, we present an approach for speaker change detection in broadcast video using joint audio-visual scene change statistics. Our experiments indicate that using joint audio-visual statistics we achieve better recall without loss of precision as compared to purely audio domain approaches for speaker change detection.
منابع مشابه
Perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction
We are exploiting the human perceptual principle of sensory integration (the joint use of audio and visual information) to improve the recognition of human activity (speech recognition, speech event detection and speaker change), intent (intent to speak) and human identity (speaker recognition), particularly in the presence of acoustic degradation due to noise and channel. In this paper, we pre...
متن کاملJoint processing of audio and visual information for multimedia indexing and human-computer interaction
Information fusion in the context of combining multiple streams of data e.g., audio streams and video streams corresponding to the same perceptual process is considered in a somewhat generalized setting. Speci cally, we consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of descriptors e.g., speech recognition/transcription,...
متن کاملSpeaker Tracking Using an Audio-visual Particle Filter
We present an approach for tracking a lecturer during the course of his speech. We use features from multiple cameras and microphones, and process them in a joint particle filter framework. The filter performs sampled projections of 3D location hypotheses and scores them using features from both audio and video. On the video side, the features are based on foreground segmentation, multi-view fa...
متن کاملJoint speaker segmentation, localization and identification for streaming audio
In this paper we investigate the problem of identifying and localizing speakers with distant microphone arrays, thus extending the classical speaker diarization task to answer the question “who spoke when and where”. We consider a streaming audio scenario, where the diarization output is to be generated in realtime with as low latency as possible. Rather than carrying out the individual segment...
متن کاملAudio-visual speaker conversion using prosody features
The article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000